GENTLE: a novel bioinformatics tool for generating features and building classifiers from T cell repertoire cancer data

Background In the global effort to discover biomarkers for cancer prognosis, prediction tools have become essential resources. TCR (T cell receptor) repertoires contain important features that differentiate healthy controls from cancer patients or differentiate outcomes for patients being treated with different drugs. Considering, tools that can easily and quickly generate and identify important features out of TCR repertoire data and build accurate classifiers to predict future outcomes are essential. Results This paper introduces GENTLE (GENerator of T cell receptor repertoire features for machine LEarning): an open-source, user-friendly web-application tool that allows TCR repertoire researchers to discover important features; to create classifier models and evaluate them with metrics; and to quickly generate visualizations for data interpretations. We performed a case study with repertoires of TRegs (regulatory T cells) and TConvs (conventional T cells) from healthy controls versus patients with breast cancer. We showed that diversity features were able to distinguish between the groups. Moreover, the classifiers built with these features could correctly classify samples (‘Healthy’ or ‘Breast Cancer’)from the TRegs repertoire when trained with the TConvs repertoire, and from the TConvs repertoire when trained with the TRegs repertoire. Conclusion The paper walks through installing and using GENTLE and presents a case study and results to demonstrate the application’s utility. GENTLE is geared towards any researcher working with TCR repertoire data and aims to discover predictive features from these data and build accurate classifiers. GENTLE is available on https://github.com/dhiego22/gentle and https://share.streamlit.io/dhiego22/gentle/main/gentle.py. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-023-05155-w.

R = Total of types or species. (1) Shannon entropy index quantifies the uncertainty -entropy or degree of surprise -of information [1]. The Shannon of each individual's repertoire is calculated using all sequences of CDR3β: where p i is probability of each individual clone i, and R is richness.
Simpson index is the proportion of two samples in a set (sampling with replacement) to belong to the same type [2].
Inverse Simpson is the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest. Inverse Simpson's index is used to represent repertoire diversity with high-frequency reads.

Inv. Simpson
Gini indexes measures the inequality among values of a proportion distribution. A Gini coefficient of zero expresses perfect equality, where all values have the same probability. A Gini coefficient of one indicates maximal inequality among types.
Pielou's index, or clonal evenness, is the ratio between the Shannon entropy and the maximization of the diversity of individuals type [3]. This measures the level of similarity in the numbers of individuals between different types in a particular environment.

-Pielou
Hill numbers, or effective number of species, is importance of the abundance distribution increases with increasing Hill order [4].
where M q−1 is the average proportional abundance of types in the dataset, and p i is the proportion of type i and q is the Hill order. The Hill number with q = 0 is the richness, for q = 1, it is the Shannon entropy and for q = 2, it is the inverse Simpson index.

The network metrics
All network metrics are implemented with the python library for Complex Networks [5].
Levenshtein distance, also known as edit distance, is the minimum number of edit (substitutions, insertions, and deletions) necessary to transform one string in another [6]. In TCR context, is number of mutations needed to convert one sequence of amino acids into another. otherwise.
where a and b are sequence of amino acids, x[n] is the n-th amino acid of the sequence x, and tail(x) is the sequence x without the first amino acid.
Density is the ratio between the edges in a graph and the maximum number of edges that the graph can contain.
where n is the number of node and m is the number of edges in the graph G.
Clustering coefficient of a node is defined as the probability that two randomly selected nodes are related with each other.
where n is the number of nodes in the graph G.
Transitivity is the ratio of all possible triangles present in a graph G. T = 3 #triangles #triads (12) where # is the cardinality of a set, and the triads are all possible triangles given by two edges with a shared vertex.

The Motif metrics
k-mers indicates all the possibles ocurrences of substrings of contiguous amino acids of length k, for k = {2, 3, 4}. A sequence of length L will have L − k + 1 k-mers and n k of total possible k-mers, where n is number of possible monomers amino acids.
Principal component analysis (PCA) is a well known method to reduce the data in a lower dimensional space.
Singular value decomposition (SVD) performs dimensionality reduction in sparse matrices efficiently. Unlike to PCA, it does not not center the data before decomposition. It is also known as latent semantic analysis (LSA).
Independent Component Analysis (ICA) implements FastICA, a fast algorithm based on [9]. The ICA reduce the noise and dimensions by maximizing a measure of non-Gaussianity with statistical independence of the estimated components.
T-distributed Stochastic Neighbor Embedding (TSNE) is a nonlinear dimensionality reduction, in order to model similar objects using nearby points, and dissimilar objects are modeled by distant points [10]. TSNE tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.
Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique for general non-linear dimension, and apply a theoretical framework based in Riemannian geometry and algebraic topology [8].
Isometric Mapping (ISOMAP) is a Non-linear dimensionality reduction through Isometric Mapping [11]. The method uses K nearest neighbors to determine the neighbors of each point, incorporates the geodesic distances in a weighted graph, Compute shortest path between two nodes, and calculates lower-dimensional embedding using multidimensional scaling.

Preprocessing normalizations
MinMaxScaler scales features by scaling each feature to a given range.
X scaled (i) = X − min(X(i) max(X(i)) − min(X(i)) * (range max − range min) + range min (13) where X is the matrix of data, i is each column/feature of the data, range min and range max is the given range of the values.
Standardize scales the mean and scaling to unit variance.
where X is the matrix of data, µ is the mean of the training samples, and σ is the standard deviation of training samples.
RobustScaler scales feature using statistics that are robust to outliers. It removes the median and scales the data according to the quantile range, in defaults to Interquartile Range (IQR).

The feature selection
Pearson calculates the correlation for each feature with the label target.
SelectFromModel with Ridge uses a base estimator of Logistic Regression to rank the features based on coefficients weights of features.
SelectFromModel with XGBoost uses a estimator to rank the features based feature importance.
min-Redundancy and Max-Relevance (mRMR) applies mutual information to select features that maximize the statistical dependency on the joint distribution of the target variable [12,13]. The maximum relevance for the feature set S, given the mutual information of feature f i in k−classes, is: The minimum redundancy in the feature subset is given by the sample vectors of pair of features f i , f j : This work uses the implementation of Python https://pypi.org/project/mrmr-selection/.

The classifiers
Gaussian Naive Bayes (GNB) performs a probabilistic classification algorithm based on applying Bayes theorem with strong independence assumptions [14].
Linear Discriminant Analysis (LDA) is a classifier linear decision boundary, that apply conditional densities to fit the data using Bayes' rule.
Logistic Regression (LR) implements the traditional well-known classifier with L2 penalty term.
Decision tree (DT) implements the traditional well-known classifier with Gini impurity.

The scoring metrics of classifiers
The scoring metrics are implemented in Python using scikit-learn [7]. All scores metrics listed here have its best value at 1 and worst score at 0.
Accuracy is defined as the ratio of TCR sequences that are correctly predicted to the positive class.
accuracy(y,ŷ) = 1 n samples where 1(x) is the characteristic function, y is the vector of target labels, andŷ is the predicted labels.
Precision, or positive predictive value, is the ratio of the positive predictions and the total of positives samples and incorrect predictions of positive class.
where tp is the number of true positives and fp the number of false positives.
Recall, or sensitivity, is the ratio of the positive predictions, and the total of positives samples and incorrect predictions of negative class.
where tp is the number of true positives and fn the number of false negatives samples.
F1 score is a harmonic mean of the precision and recall.